Data Import and Preparation
We Import the dataset SP500_data.csv and make a copy to work
with it and named it data. We copy it so we can be secure that
i do not make any changes in the original dataset.
We use several
libraries to process the tasks and get the output that is asked.
Data Exploration
This section gives a concise view of the Tweets on the Swiss
Univerity Social Media accounts data.
The dataset consists 19’575
observations and 14 variables:
Time Range and Tweet Frequency:
- Tweets are from September 29, 2009, to January 26, 2023 and this indicates a long-term use of Twitter
- The median tweet date is April 13, 2018, suggesting that half of the
tweets were posted after this date and the data is skewed
*Retweet and Favorite Counts:**
- The data shows a minimum of 0 and a maximum of 267 retweets and 188 likes per tweet
- the median and first quartile for retweets and likes are 0, indicating that many tweets receive little to no engagement
- The
in_reply_to_screen_namefield suggests that some tweets are responses to other users, which might indicate engagement or conversation strategies used by the university
ID and String Variables:
- The id
and id_str fields are technical identifiers for tweets,
indicating that tweets have been collected over a wide range of tweet
Language and University Fields:
- The
langshows the common language used at the university universityshows the abbreviation of the university
Temporal Patterns:
created_at,tweet_date,tweet_hour, andtweet_monthprovide detailed temporal data- can be analyzed to understand peak times of activity and seasonal or
monthly trends in tweeting behavior.
Content Analysis
The word cloud represents the most frequently used words in the filtered tweets with high engagement (likes or retweets). Key observations include:
Frequent Terms: Larger words such as “bachelor,” “design,” “die,” “das,” “der,” and “amp” indicate their higher occurrence. Key Topics: “bachelor” for Bachelor’s programs or graduates. “design” related to design courses or projects. “HSLU” (Hochschule Luzern). General terms: “schweiz,” “zeigen,” “nicht.” Note: The term “amp” appears due to HTML encoding and is not meaningful.
## # A tibble: 6 × 14
## created_at id id_str full_text in_reply_to_screen_n…¹
## <dttm> <dbl> <chr> <chr> <chr>
## 1 2023-01-20 17:17:32 1.62e18 1616469988369469… "Im MSc … <NA>
## 2 2023-01-13 07:52:01 1.61e18 1613790954737074… "Was bew… <NA>
## 3 2023-01-12 19:30:01 1.61e18 1613604227141537… "Was uns… <NA>
## 4 2023-01-12 08:23:00 1.61e18 1613436367169634… "Eine di… <NA>
## 5 2023-01-11 14:00:05 1.61e18 1613158809081450… "Wir gra… <NA>
## 6 2023-01-10 17:06:11 1.61e18 1612843252083834… "Unsere … <NA>
## # ℹ abbreviated name: ¹in_reply_to_screen_name
## # ℹ 9 more variables: retweet_count <int>, favorite_count <int>, lang <chr>,
## # university <chr>, tweet_date <dttm>, tweet_minute <dttm>,
## # tweet_hour <dttm>, tweet_month <date>, timeofday_hour <chr>
## created_at id
## Min. :2009-09-29 14:29:47.0 Min. : 4468752018
## 1st Qu.:2015-01-28 15:07:41.5 1st Qu.: 560439073866000000
## Median :2018-04-13 13:26:56.0 Median : 984754806702000000
## Mean :2017-12-09 15:26:50.7 Mean : 939953703992000000
## 3rd Qu.:2020-10-20 10:34:50.0 3rd Qu.:1318470720360000000
## Max. :2023-01-26 14:49:31.0 Max. :1618607065240000000
## id_str full_text in_reply_to_screen_name
## Length:19575 Length:19575 Length:19575
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## retweet_count favorite_count lang university
## Min. : 0.000 Min. : 0.00 Length:19575 Length:19575
## 1st Qu.: 0.000 1st Qu.: 0.00 Class :character Class :character
## Median : 1.000 Median : 0.00 Mode :character Mode :character
## Mean : 1.289 Mean : 1.37
## 3rd Qu.: 2.000 3rd Qu.: 2.00
## Max. :267.000 Max. :188.00
## tweet_date tweet_minute
## Min. :2009-09-29 00:00:00.00 Min. :2009-09-29 14:29:00.00
## 1st Qu.:2015-01-28 00:00:00.00 1st Qu.:2015-01-28 15:07:00.00
## Median :2018-04-13 00:00:00.00 Median :2018-04-13 13:26:00.00
## Mean :2017-12-09 02:25:45.00 Mean :2017-12-09 15:26:24.68
## 3rd Qu.:2020-10-20 00:00:00.00 3rd Qu.:2020-10-20 10:34:30.00
## Max. :2023-01-26 00:00:00.00 Max. :2023-01-26 14:49:00.00
## tweet_hour tweet_month timeofday_hour
## Min. :2009-09-29 14:00:00.00 Min. :2009-09-01 Length:19575
## 1st Qu.:2015-01-28 14:30:00.00 1st Qu.:2015-01-01 Class :character
## Median :2018-04-13 13:00:00.00 Median :2018-04-01 Mode :character
## Mean :2017-12-09 14:59:43.81 Mean :2017-11-24
## 3rd Qu.:2020-10-20 10:00:00.00 3rd Qu.:2020-10-01
## Max. :2023-01-26 14:00:00.00 Max. :2023-01-01
Data Manipulation
Languages
Here we calculate the frequency of each language present in the
tweets dataset and sorts these frequencies in descending order.
The
output indicates that German (de) is the most common language with
14,474 occurrences, followed by Italian (it) with 1,865 and French (fr)
with 1,792. English (en) comes next with 1,280 tweets. The frequencies
of other languages, including rare and less commonly used ones, are also
listed, showcasing the linguistic diversity in the dataset.
# Count the frequency of each language
lang_counts <- table(tweets$lang)
# Sort the language frequencies in descending order
sort(lang_counts, decreasing = TRUE)##
## de it fr en qam qme es ca da ro nl in et
## 14474 1865 1792 1280 35 21 19 10 10 10 9 7 6
## und pt zxx art lv cy fi lt no qht cs eu ht
## 6 4 4 3 3 2 2 2 2 2 1 1 1
## ja sv tl tr
## 1 1 1 1
Due to the fact that German, Italian, French and English are the
most frequently listed languages and other languages are not used in
large numbers and are not among the most spoken languages in
Switzerland, we limit the data set to the 4 most important ones.
# Filter the DataFrame to keep only tweets in German, Italian, French and English
filtered_tweets <- tweets[tweets$lang %in% c("de", "it", "fr", "en"), ]
# Check the resulting language distribution
table(filtered_tweets$lang)##
## de en fr it
## 14474 1280 1792 1865
This gives us the new Summeray of the data set:
- Number of Records: The total count of tweets has decreased from 19,575 to 19,411, indicating some tweets have been removed or filtered out.
- Date and Time: Minimal changes are reflected across the median and mean values.
- Other Attributes: No significant changes are observed in the ranges.
## created_at id
## Min. :2009-09-29 14:29:47.00 Min. : 4468752018
## 1st Qu.:2015-02-04 11:39:32.00 1st Qu.: 562923403041000000
## Median :2018-04-17 13:53:07.00 Median : 986210946744999936
## Mean :2017-12-11 15:27:49.55 Mean : 940675313339000064
## 3rd Qu.:2020-10-20 11:09:15.50 3rd Qu.:1318479385120000000
## Max. :2023-01-26 14:49:31.00 Max. :1618607065240000000
## id_str full_text in_reply_to_screen_name
## Length:19411 Length:19411 Length:19411
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## retweet_count favorite_count lang university
## Min. : 0.000 Min. : 0.000 Length:19411 Length:19411
## 1st Qu.: 0.000 1st Qu.: 0.000 Class :character Class :character
## Median : 1.000 Median : 0.000 Mode :character Mode :character
## Mean : 1.293 Mean : 1.376
## 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :267.000 Max. :188.000
## tweet_date tweet_minute
## Min. :2009-09-29 00:00:00.0 Min. :2009-09-29 14:29:00.00
## 1st Qu.:2015-02-04 00:00:00.0 1st Qu.:2015-02-04 11:39:00.00
## Median :2018-04-17 00:00:00.0 Median :2018-04-17 13:53:00.00
## Mean :2017-12-11 02:26:53.7 Mean :2017-12-11 15:27:23.56
## 3rd Qu.:2020-10-20 00:00:00.0 3rd Qu.:2020-10-20 11:09:00.00
## Max. :2023-01-26 00:00:00.0 Max. :2023-01-26 14:49:00.00
## tweet_hour tweet_month timeofday_hour
## Min. :2009-09-29 14:00:00.00 Min. :2009-09-01 Length:19411
## 1st Qu.:2015-02-04 11:30:00.00 1st Qu.:2015-02-01 Class :character
## Median :2018-04-17 13:00:00.00 Median :2018-04-01 Mode :character
## Mean :2017-12-11 15:00:42.28 Mean :2017-11-26
## 3rd Qu.:2020-10-20 10:30:00.00 3rd Qu.:2020-10-01
## Max. :2023-01-26 14:00:00.00 Max. :2023-01-01
Emojis
The package emo is used for emoji analysis in R, which
is essential for text data that includes emojis. This is useful for
cleaning data, extracting information, or preparing text for further
analysis.
Understanding the prevalence of emojis can help analyze
sentiment, user engagement, or cultural trends in social media data.
# Install the emo package from GitHub for Emoji analyzes
if (!require("emo")) {
remotes::install_github("hadley/emo")
}## Lade nötiges Paket: emo
Text Preprocessing
We create a text corpus from filtered_tweets$clean_text,
where each tweet is treated as a separate document.
The corpus
serves as the foundational structure for text analysis, allowing for
uniform processing and manipulation of the text data.
# Corpus: Collection of text documents that generally serves as a basis for analysis in text processing and text mining.
# VectorSource(tweets): This vector is then used as the source for the corpus, whereby each entry in the vector becomes a separate document in the corpus.
# It is important that the text is extracted, as the corpus should only work with text data.
corpus <- Corpus(VectorSource(filtered_tweets$clean_text))
Here we clean the corpus by converting all text to lowercase,
removing punctuation, numbers, and stopwords from German, French,
Italian, and English, and finally stripping extra spaces.
Cleaning
the text is crucial for reducing noise and focusing analyses on
meaningful words only. This standardizes the text data, making
subsequent analyses like topic modeling or sentiment analysis more
effective and less prone to error due to textual inconsistencies.
# Clean text
corpus <- tm_map(corpus, content_transformer(tolower)) # Convert to lower case
corpus <- tm_map(corpus, removePunctuation) # Removing punctuation marks
corpus <- tm_map(corpus, removeNumbers) # Removing numbers
corpus <- tm_map(corpus, removeWords, stopwords("german")) # Removing stop words
corpus <- tm_map(corpus, removeWords, stopwords("french"))
corpus <- tm_map(corpus, removeWords, stopwords("italian"))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace) # Removal of additional spaces
corpus <- tm_map(corpus, stemDocument) #remove suffixes, etc.; only root form of the word
# Further clean the text by removing specific web/text symbols and terms
corpus <- tm_map(corpus, content_transformer(function(x) {
x <- gsub("–", "", x)
x <- gsub("…", "", x)
x <- gsub("«", "", x)
x <- gsub("»", "", x)
x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE) # Remove 'rt', 'www', and 'emojiemoji'
x <- gsub("amp", "", x, ignore.case = TRUE) # Remove 'amp' from HTML encoded '&'
x <- gsub("http[s]?://\\S+", "", x) # Remove URLs
return(x)
}))
Here we create a Document-Term Matrix (DTM) from the corpus,
applying additional filters like punctuation removal and stopping word
exclusion during the matrix formation. Then, it filters out terms that
appear in less than 1% of the documents to reduce sparsity.
Reducing sparsity helps focus on terms that have significant presence
across documents, enhancing the reliability and performance of
statistical models and algorithms applied later.
Tweet Analysis
Frequency
The function returns a vector of terms that meet the specified frequency threshold. In this case, terms such as “schweizer”, “bfh”, “neuen”, “emoji”, and others are listed, indicating they are common within the dataset. By setting a high frequency threshold (e.g., 25 occurrences), you can focus on terms that are more relevant across the dataset.
# Check the term frequencies
findFreqTerms(dtm1, lowfreq = 25) # Shows terms that occur at least 25 times## [1] "dank" "innov" "schweizer" "unternehmen"
## [5] "arbeit" "dass" "zeigt" "bfh"
## [9] "neuen" "unser" "digital" "entwickelt"
## [13] "zukunft" "emoji" "startup" "mehr"
## [17] "neue" "gibt" "info" "berner"
## [21] "blogbeitrag" "statt" "sozial" "beim"
## [25] "menschen" "projekt" "bern" "geht"
## [29] "team" "schweiz" "dabei" "forschung"
## [33] "studi" "bfhhesb" "cus" "informatik"
## [37] "technik" "studierenden" "depart" "anmelden"
## [41] "onlin" "uhr" "thema" "heut"
## [45] "busi" "ab" "wurd" "zwei"
## [49] "swiss" "bachelor" "digit" "interview"
## [53] "immer" "digitalisierung" "findet" "institut"
## [57] "morgen" "zhaw" "dr" "herzlich"
## [61] "jahr" "erst" "hochschul" "kunst"
## [65] "rahmen" "fachhochschul" "mai" "srf"
## [69] "wünschen" "studierend" "master" "scienc"
## [73] "student" "design" "fhnw" "prof"
## [77] "hsafhnw" "manag" "via" "studium"
## [81] "luzern" "hslu" "social" "fhnwtechnik"
## [85] "htwchur" "cc" "hesso" "infoanlass"
## [89] "http" "projet" "chur" "htw"
## [93] "engineeringzhaw" "fhnwbusi" "supsi" "graubünden"
Words like “schweizer” (Swiss), “unternehmen” (companies), “zukunft”
(future), “innov” (innovation), and “digital” suggest that the text data
heavily revolves around themes of Swiss companies, innovation, and
digital advancements.
Frequent appearance of terms like “dank”
(thanks), “neue” (new), “mehr” (more), and “info” indicate common
communication patterns possibly related to news dissemination or updates
about new developments and initiatives.
set.seed(123)
# Ensure word names are captured
word_freq1 <- sort(rowSums(as.matrix(dtm1)), decreasing = TRUE)
top_word_freq1 <- head(word_freq1, 80)
word_names1 <- colnames(dtm1)
# Generate word cloud using the correct word names
wordcloud(
words = word_names1,
freq = top_word_freq1,
max.words = 80,
scale = c(4, 0.5), # Control for size of the most and least frequent words
random.order = FALSE, # Higher frequency words appear first
rot.per = 0.25, # Allows some rotation for fitting
colors = brewer.pal(8, "Dark2") # Enhances visual appeal
)# Code to analyze tweet frequencies by time and university
p1<- filtered_tweets %>%
mutate(tweet_month = floor_date(created_at, "month")) %>%
group_by(university, tweet_month) %>%
summarize(count = n(), .groups = 'drop') %>%
ggplot(aes(x = tweet_month, y = count, fill = university)) +
geom_col(position = "dodge") +
theme_minimal() +
labs(title = "Monthly Tweet Frequency by University", x = "Year", y = "Number of Tweets")
# Convert to interactive plotly object
interactive_plot <- ggplotly(p1, tooltip = "text")
# Optionally, add configurations to enhance interaction
interactive_plot <- interactive_plot %>% layout(
hovermode = 'closest',
title = "Click on a University to see its Tweet Trends",
showlegend = TRUE
)
interactive_plotHigh Engagement
This section sets a variable engagement_threshold to 20,
which is used as the minimum number of likes or retweets a tweet must
have to be considered as having “high engagement”. This threshold helps
to focus on tweets that have garnered more attention and
interaction.
# Set a threshold for "high engagement" (e.g., tweets with at least 20 likes or retweets)
engagement_threshold <- 20
# Filter tweets based on this engagement threshold
high_engagement_tweets <- filtered_tweets %>%
filter(favorite_count >= engagement_threshold | retweet_count >= engagement_threshold)Also for the high_engagement_tweets we clean the corpus
by converting all text to lowercase, removing punctuation, numbers, and
stopwords from German, French, Italian, and English, and finally
stripping extra spaces and we create a Document-Term Matrix (DTM) from
this corpus.
# Rebuild the corpus with the sampled data
corpus2 <- Corpus(VectorSource(high_engagement_tweets$clean_text))
corpus2 <- tm_map(corpus2, content_transformer(tolower)) # Convert to lower case
corpus2 <- tm_map(corpus2, removePunctuation) # Removing punctuation marks
corpus2 <- tm_map(corpus2, removeNumbers) # Removing numbers
corpus2 <- tm_map(corpus2, removeWords, stopwords("german")) # Removing stop words
corpus2 <- tm_map(corpus2, removeWords, stopwords("french"))
corpus2 <- tm_map(corpus2, removeWords, stopwords("italian"))
corpus2 <- tm_map(corpus2, removeWords, stopwords("english"))
corpus2 <- tm_map(corpus2, stripWhitespace) # Removal of additional spaces
corpus2 <- tm_map(corpus2, stemDocument) #remove suffixes, etc.; only root form of the word
# Further clean the text by removing specific web/text symbols and terms
corpus2 <- tm_map(corpus2, content_transformer(function(x) {
x <- gsub("–", "", x)
x <- gsub("…", "", x)
x <- gsub("«", "", x)
x <- gsub("»", "", x)
x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE) # Remove 'rt', 'www', and 'emojiemoji'
x <- gsub("amp", "", x, ignore.case = TRUE) # Remove 'amp' from HTML encoded '&'
x <- gsub("http[s]?://\\S+", "", x) # Remove URLs
return(x)
}))
# Create DTM and remove sparse terms
dtm <- DocumentTermMatrix(corpus2, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm <- removeSparseTerms(dtm, sparse = 0.99) # Adjust sparsity threshold as neededThe word cloud effectively illustrates which topics are most engaging
within the parameter for at least 20 likes or retweets. This
visualization can help in refining the communication and engagement
strategies by focusing on the topics that naturally engage your
audience.
- The word cloud highlights Words like “digital,” “Data,” and “Open” emphasize a strong focus on digital innovation and open data or technology. This suggests that tweets discussing digital technologies or data transparency tend to receive higher engagement.
- Terms such as “forscherteam” (research team), “univers” (universities), and “lab” indicate that the content related to academic research or laboratory work resonates well with the audience. This could be within a university setting or tech-related academic research.
- Words like “revolutionieren” (revolutionize), “entwickelt” (developed), and “chanc” (chances) suggest discussions around innovation and development are highly engaging.
- “Gespräch” (conversation/discussion) indicates that interactive or discussion-based tweets, perhaps those inviting comments or thoughts from the community, are among those that receive more likes and retweets.
- Words like “mithilf” (with help) and phrases possibly related to collaboration highlight the community.
set.seed(123)
# Ensure word names are captured
word_freq <- sort(rowSums(as.matrix(dtm)), decreasing = TRUE)
top_word_freq <- head(word_freq, 80)
word_names <- colnames(dtm)
# Generate word cloud using the correct word names
wordcloud(
words = word_names,
freq = top_word_freq,
max.words = 80,
scale = c(4, 0.5), # Control for size of the most and least frequent words
random.order = FALSE, # Higher frequency words appear first
rot.per = 0.25, # Allows some rotation for fitting
colors = brewer.pal(8, "Dark2") # Enhances visual appeal
)# Analyze the frequency of different emojis
emoji_freq1 <- table(unlist(high_engagement_tweets$emojis))
sort(emoji_freq1, decreasing = TRUE)##
## ➡️ 🇨🇭 ⤵️ ✨ 🇨🇳 🇬🇧 🇳🇱 🇸🇪 🇸🇬 👉 💛 📅 📢 🗞️ 😀 😉 🚊 🚨
## 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Engagement Analysis by University
The bar chart visualizes the total likes accumulated by different
universities within the parameter for at least 20 likes or retweets,
highlighting variations in engagement across these institutions on
social media.
The visualization clearly shows which universities
are receiving the most engagement in terms of likes. HSLU (Lucerne
University of Applied Sciences and Arts) and ZHAW (Zurich University of
Applied Sciences) stands out with the highest engagement, significantly
more than other institutions. So institutions like HSLU and ZHAW,
offering a pathway for others to refine their social media tactics.
# Analysis of likes and retweets
high_engagement_tweets %>%
group_by(university) %>%
summarize(total_likes = sum(favorite_count), total_retweets = sum(retweet_count), .groups = 'drop') %>%
ggplot(aes(x = reorder(university, total_likes), y = total_likes)) +
geom_col() +
coord_flip() +
labs(title = "Engagement Analysis by University", x = "University", y = "Total Likes")HSLU & ZHAW Engagement Analysis
#Filter Tweets for HSLU and ZHAW
hslu_zhaw_tweets <- filtered_tweets %>%
filter(university %in% c("HSLU", "ZHAW"))
# Set a threshold for "high engagement" (e.g., tweets with at least 10 likes or retweets)
engagement_threshold1 <- 10
# Filter tweets based on this engagement threshold
hslu_zhaw_high_engagement_tweets <- hslu_zhaw_tweets %>%
filter(favorite_count >= engagement_threshold1 | retweet_count >= engagement_threshold1)
# Rebuild the corpus with the sampled data
corpus3 <- Corpus(VectorSource(hslu_zhaw_high_engagement_tweets$clean_text))
corpus3 <- tm_map(corpus3, content_transformer(tolower)) # Convert to lower case
corpus3 <- tm_map(corpus3, removePunctuation) # Removing punctuation marks
corpus3 <- tm_map(corpus3, removeNumbers) # Removing numbers
corpus3 <- tm_map(corpus3, removeWords, stopwords("german")) # Removing stop words
corpus3 <- tm_map(corpus3, removeWords, stopwords("french"))
corpus3 <- tm_map(corpus3, removeWords, stopwords("italian"))
corpus3 <- tm_map(corpus3, removeWords, stopwords("english"))
corpus3 <- tm_map(corpus3, stripWhitespace) # Removal of additional spaces
corpus3 <- tm_map(corpus3, stemDocument) #remove suffixes, etc.; only root form of the word
# Further clean the text by removing specific web/text symbols and terms
corpus3 <- tm_map(corpus3, content_transformer(function(x) {
x <- gsub("–", "", x)
x <- gsub("…", "", x)
x <- gsub("«", "", x)
x <- gsub("»", "", x)
x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE) # Remove 'rt', 'www', and 'emojiemoji'
x <- gsub("amp", "", x, ignore.case = TRUE) # Remove 'amp' from HTML encoded '&'
x <- gsub("http[s]?://\\S+", "", x) # Remove URLs
return(x)
}))
# Create DTM and remove sparse terms
dtm2 <- DocumentTermMatrix(corpus3, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm2 <- removeSparseTerms(dtm2, sparse = 0.99) # Adjust sparsity threshold as neededset.seed(123)
# Ensure word names are captured
word_freq2 <- sort(rowSums(as.matrix(dtm2)), decreasing = TRUE)
top_word_freq2 <- head(word_freq2, 80)
word_names2 <- colnames(dtm2)
# Generate word cloud using the correct word names
wordcloud(
words = word_names2,
freq = top_word_freq2,
max.words = 80,
scale = c(4, 0.5), # Control for size of the most and least frequent words
random.order = FALSE, # Higher frequency words appear first
rot.per = 0.25, # Allows some rotation for fitting
colors = brewer.pal(8, "Dark2") # Enhances visual appeal
)# Analyze the frequency of different emojis
emoji_freq <- table(unlist(hslu_zhaw_high_engagement_tweets$emojis))
sort(emoji_freq, decreasing = TRUE)##
## 👉 ➡️ ⚖️ 🇨🇭 🌍 🌳 👋 💛 💬 💻 📈 📰 😀 🤔 🥑 🥗
## 4 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1